Add auto cv pipeline mulit buffer enable and preload support by zhangstevenunity · Pull Request #620 · hw-native-sys/PTOAS

zhangstevenunity · 2026-05-05T07:42:32Z

No description provided.

PR #615 added the SyncCodegen scaffolding for multi-buffer set/wait_flag_dyn ops, but the upstream analysis path was inert: GetEventIdNum hard-coded return 1, and PTOIRTranslator dropped every pto.pointer_cast addr operand beyond the first. The N>1 multi-buffer code path was therefore unreachable. This change makes the PlanMemory -> InsertSync handshake actually work: - PTOIRTranslator::UpdatePointerCastOpMemInfo now translates *all* N pto.pointer_cast i64 addr operands into BaseMemInfo.baseAddresses (constants -> bit offsets, non-constants -> hasVariableAddress=true). Subview/alias deltaOffset is applied to every slot, not just slot 0. - MemoryDependentAnalyzer gains getMultiBufferSlotCount(a, b) implementing HIVM's per-slot geometry check: same-index slots must overlap (real back-edge dep on the same physical buffer) and different-index slots must NOT overlap (so consecutive iterations land in disjoint physical buffers). - InsertSyncAnalysis::GetEventIdNum is rewritten to follow HIVM semantics: every dependent pair must be multi-buffer-eligible, all pairs must agree on N, and every involved buffer must hang off the same scf.for. Verified end-to-end: a `pto.multi_buffer = 2` alloc inside scf.for now emits 2 reserved event ids, an `iv mod 2` arith.select chain, and `pto.{set,wait}_flag_dyn` ops driven by the selected idx. New regression guards this in test/lit/pto/multi_buffer_insert_sync_dyn_event_id.pto. Co-Authored-By: Claude <noreply@anthropic.com>

Builds on P0 to make the multi-buffer pipeline robust under realistic pressure: - SyncEventIdAllocation: HIVM-style TryFallbackMultiBuffer. When the event-id pool can't satisfy N, fall back to N=2 first (when N is even and >2) then to 1, instead of silently returning with an empty eventIds vec. Both set/wait siblings get their `eventIdNum` updated in lockstep so SyncCodegen takes the same code path on each side. - PTOPlanMemory: add `StorageEntry::isMultiBufferSlot` flag and rewrite `UpdateBuffer2Offsets` to walk via primaries and emit slots in declared `relationOtherBuffers` order. This makes the slot-ordering contract that EnableMultiBuffer relies on (`offsets[i]` <-> iv mod N selector index `i`) explicit and verifies it via a runtime assertion. - PTOEnableMultiBuffer: add scope guard (skip non-VEC/MAT casts where the iv-mod-N selector is not meaningful) and loop-invariance guard (skip casts whose addrs are not loop-invariant - hoisting them above the for loop would break SSA dominance). - New regression at test/lit/pto/multi_buffer_n4_insert_sync.pto: N=4 slot ordering [0, 1024, 2048, 3072] in pto.pointer_cast, 3-way arith.select chain, and 4 distinct event ids drained at function end. Co-Authored-By: Claude <noreply@anthropic.com>

Final hardening pass tightening multi-buffer correctness around the edges PR-615 and earlier P0/P1 commits left ambiguous. - InsertSyncAnalysis: GetEventIdNum now takes the back-edge scf.for and requires every involved buffer's enclosing loop to *equal* that loop. Previously a multi-buffer alloc nested inside an inner loop could silently rotate slots on the wrong induction variable when a back-edge on an outer loop carried the dep. Forward deps unconditionally collapse to single-buffer. - PTOEnableMultiBuffer: cache `iv mod N` per (loop, N) so several multi-buffer pointer_casts sharing a loop reuse one counter, mirroring SyncCodegen::loop2BufferCounter. Avoids redundant arith.remui + constant ops in the loop body. - PTOPlanMemory: * `MemPlan::emitMultiBufferError` + `multiBufferDiagnosticEmitted_` flag converts multi-buffer-specific `report_fatal_error` calls (slot-order mismatch, pong outline placement failure) into diagnostics that bubble through `plan()` as `failure()`. This lets the existing PlanMemoryPass retry loop re-seed instead of aborting the compiler under heavy multi-buffer memory pressure. * `VerifyConflictStage1` now enumerates *every* historical multi-buffer slot offset (via `CollectMultiRelationPongAnchors`) as a SPEC_LEVEL_1 reuse anchor candidate. PR-615 only used the last slot, silently dropping reuse for N > 2. - New regression at test/lit/pto/multi_buffer_nested_loop.pto: nested scf.for with multi-buffer alloc in the inner loop must rotate slots on the inner iv, not the outer one. Co-Authored-By: Claude <noreply@anthropic.com>

`VerifyConflictStage2`'s contract (per its own comment) is "block reuse only when buffers (a) share a parent loop AND (b) have a DMA pipe conflict". `PipeConflictInSameLoop` violated that on two counts: 1. After confirming same parent loop, it returned true unconditionally instead of querying `PipeConflict` for the actual DMA pipe-conflict relation. SPEC_LEVEL_2 was therefore stricter than SPEC_LEVEL_3 semantically: any same-loop pair was rejected, even when no DMA pipe conflict existed. 2. `GetBufferParentLoop` returns nullptr for top-level buffers; two top-level buffers compared parentLoop1 == parentLoop2 == nullptr, so the "different loops -> allow reuse" early-return was bypassed and the function fell through to "return true". Almost every cross-buffer pair at function scope was getting marked as conflicting, blocking valid reuse and causing local-memory planning to fail/overflow in cases that previously fit. Fix: reject the nullptr case explicitly, then fall through to the real `PipeConflict` query before declaring a conflict. 166/166 lit tests still pass; the change can only widen the set of allowed reuses, never narrow it. Co-Authored-By: Claude <noreply@anthropic.com>

After merging upstream/main, multi-buffer-enabled loops emitted spurious `pto.barrier <PIPE_MTE2>` and `pto.barrier <PIPE_MTE3>` ops in addition to the multi-buffer `set_flag_dyn` / `wait_flag_dyn` pair, e.g.: pto.wait_flag_dyn[<PIPE_MTE3>, <PIPE_MTE2>, %2] pto.barrier <PIPE_MTE2> <- redundant pto.tload ; (MTE2) pto.set_flag[<PIPE_MTE2>, <PIPE_MTE3>, <EVENT_ID0>] pto.wait_flag[<PIPE_MTE2>, <PIPE_MTE3>, <EVENT_ID0>] pto.barrier <PIPE_MTE3> <- redundant pto.tstore ; (MTE3) pto.set_flag_dyn[<PIPE_MTE3>, <PIPE_MTE2>, %2] Both barriers came from `InsertSyncOperation`'s same-pipe back-edge branch unconditionally emitting `PIPE_BARRIER` for any cross-iteration dep on a single pipe. Two of these were truly redundant: 1. Same-pipe back-edge whose dep is multi-buffer eligible. The whole point of multi-buffer is that consecutive iterations land in disjoint physical slots, so the cross-iter "same-address" dep is fundamentally false; the multi-buffer dyn flag pair already does the cross-iter ordering. No barrier needed. 2. Same-pipe back-edge on a DMA pipe (MTE1/MTE2/MTE3/MTE4/MTE5). DMA pipes are simple in-order command queues; the hardware itself serializes consecutive ops on the same DMA pipe across iterations. PIPE_BARRIER on a DMA pipe is a no-op and just clutters the IR. Fix: in `InsertSyncOperation`, compute the multi-buffer eventIdNum once (also resolves the back-edge scf.for so the cross-pipe path can reuse it - small refactor of the existing logic) and skip the barrier when either condition above holds. PIPE_M / PIPE_V keep the conservative PIPE_BARRIER to preserve the "bar_m" / "bar_v" intra-pipe semantics that higher-level frontends rely on (the existing comment in `IsNoNeedToInsertSync` calls this out). 168/168 lit pass, including all multi-buffer regressions and the upstream-merged comm/sync tests. Co-Authored-By: Claude <noreply@anthropic.com>

…ehmann-807dab # Conflicts: # lib/PTO/Transforms/CMakeLists.txt # tools/ptoas/ptoas.cpp

…ehmann-807dab

Wire multi-buffer support into the GraphSyncSolver path so it matches the behaviour the InsertSync path got in earlier P0/P1/P2 commits. Mirrors the intra-core subset of `bishengir/lib/Dialect/HIVM/Transforms/GraphSyncSolver` (`SyncSolver::getEventIdInfo`, `getMultiBufferEventIdInfo`, `getMultiBufferLoop`, `SyncSolverCodeGen::getMultiBufferSelectOp`). Detection (Utility.h, SyncSolver.h/.cpp): - `EventIdInfo` (eventIdNum + multibufferLoop) carried per ConflictPair. - `getMultiBufferLoop` finds the common scf.for whose iv drives the slot rotation; reuses the per-slot overlap helper added in P0 (MemoryDependentAnalyzer::getMultiBufferSlotCount), so the same same-index-overlap / different-index-disjoint geometry rule applies. - `getMultiBufferEventIdInfo` requires every dependent pair to agree on N and share the same scf.for; failure -> single-buffer. - `getEventIdInfo` is the top-level wrapper with the backward-only gate (mirrors HIVM's "only optimise back-edge deps"). Solver wiring (handleSetWaitConflict): - Threads rwOp1/rwOp2 in so MB info is computable. - Allocates N event ids via the existing Welsh-Powell coloring node (EventIdNode already supports eventIdNum > 1). - Falls back to N=1, then PIPE_ALL barrier, if N ids cannot be coloured. Codegen (SyncSolverCodeGen.h/.cpp): - New `emitMultiBufferSetWait` emits the HIVM/InsertSync output shape: pre-loop N pto.set_flag (queue prime), in-body iv-mod-N counter + N-way arith.select chain + pto.{wait,set}_flag_dyn, post-loop N pto.wait_flag (queue drain). The dyn ops live INSIDE the loop body so they share the selector's dominance (GSS's default backward-sync anchors are at the loop boundary, which works for single-buffer but not for a per-iteration selector). - `loop2BufferCounter_` cache reuses one `iv mod N` across multiple ConflictPairs sharing a (loop, N) tuple. Test: - New `multi_buffer_gss_dyn_event_id.pto` is the GSS counterpart of the existing InsertSync N=2 regression; checks pre-loop primes, the selector chain, dyn set/wait, and post-loop drains. 191/191 lit pass. Co-Authored-By: Claude <noreply@anthropic.com>

chatgpt-codex-connector · 2026-05-05T07:44:04Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

gemini-code-assist

Code Review

This pull request implements a CV Preload optimization framework for PTOAS, introducing new IR operations and passes to facilitate multi-buffering and software pipelining between Cube and Vector kernels. The implementation covers automatic multi-buffer marking, scope identification, and stage-guarded loop expansion, alongside significant updates to memory planning and synchronization solvers. Technical feedback highlights several critical areas for refinement: ensuring deterministic IR generation by avoiding DenseMap iteration, permitting pure operations like constants within loop bodies to align with the design specification, and addressing logic gaps in scope identification that currently miss leading or trailing operations. Furthermore, the reviewer recommends strengthening conflict analysis for top-level buffers to mitigate potential hardware hazards.

gemini-code-assist · 2026-05-05T07:45:03Z

+  SmallVector<LocalBufferBinding> orderedBindings;
+  orderedBindings.reserve(localBindings.size());
+  for (const auto &it : localBindings)


The iteration over localBindings (which is a DenseMap) is non-deterministic. This causes the order of cloned PointerCastOp and BindTileOp operations at the beginning of the expanded loop body to vary across different compiler runs. While the resulting code is functionally equivalent, non-deterministic IR generation makes debugging, testing, and IR diffing significantly more difficult.

Consider using a MapVector or maintaining a separate SmallVector to track the insertion order of bindings during the walk.

gemini-code-assist · 2026-05-05T07:45:03Z

+    if (!scope) {
+      return op.emitOpError(
+          "non-CV-scope op cannot be preserved by pto-cv-create-preload yet");
+    }


This check is overly restrictive. It causes the pass to fail if the loop body contains any operation not wrapped in a pto.cv.scope, including pure operations like arith.constant or index arithmetic. The design document (Section 7.8) explicitly mentions that constants and index arithmetic should be allowed and cloned for each stage.

Restricting the loop body to only CVScopeOp makes the pass fragile and forces users to wrap trivial setup code into scopes unnecessarily.

gemini-code-assist · 2026-05-05T07:45:03Z

+    if (action.isTPop) {
+      if (pending.active)
+        includeMissingThrough(pending, op.getPrevNode());
+      collectScopeIfValid(pending, core, scopes);
+      pending.input = action.input;
+      includeRange(pending, &op, &op);
+      segmentStart = &op;
+      continue;


In collectScopesInFor, when the first TPipe operation encountered is a tpop, any operations preceding it in the loop body (e.g., induction variable adjustments or constant definitions) are not included in any scope. Since CVCreatePreloadPass requires all operations in the loop body to be wrapped in scopes, this gap will cause the expansion pass to fail on valid IR.

The logic should ensure that leading operations are either pulled into the first scope or that the identification process covers the entire block range.

gemini-code-assist · 2026-05-05T07:45:03Z

+  if (pending.active)
+    includeMissingThrough(pending, terminator ? terminator->getPrevNode()
+                                              : nullptr);
+  collectScopeIfValid(pending, core, scopes);


Similar to the issue with leading operations, if the loop body contains non-TPipe operations after the last identified scope has been closed (i.e., pending.active is false), these trailing operations are missed and left unwrapped. This will lead to expansion failures in the subsequent pto-cv-create-preload pass.

gemini-code-assist · 2026-05-05T07:45:03Z

+  if (!parentLoop1 || !parentLoop2) {
+    return false;
+  }


In PipeConflictInSameLoop, returning false when both parentLoop1 and parentLoop2 are nullptr allows top-level buffers (defined at function scope) to reuse the same memory offsets even if they have conflicting pipe usage and overlapping lifetimes.

While SPEC_LEVEL_2 is intended to be less restrictive than SPEC_LEVEL_3, top-level buffers are effectively in the same execution scope. Allowing them to alias while being used by conflicting hardware units (e.g., MTE and Vector) simultaneously could lead to hardware hazards. It would be safer to check PipeConflict if both are nullptr.

reedhecre · 2026-05-05T08:13:51Z

Codex Review

该评论由 review 机器人自动更新。

PR: Add auto cv pipeline mulit buffer enable and preload support #620 Add auto cv pipeline mulit buffer enable and preload support
Author: zhangstevenunity
Base/Head: main / codex/cv-preload-scope-marking
Head SHA: ef2b6c6a3f71
Trigger: PR 有新提交
Generated At: 2026-05-06T11:54:25Z
Previous Head SHA: aa5ff4602e09
Status: completed

Summary

PR #620 introduces two P1 correctness regressions in the new CV multi-buffer/preload pipeline and one P2 error-handling bug in PlanMemory.

Findings

P1 Default pipeline auto-marks CV multi-buffering without also enabling buffer-slot rotation tools/ptoas/ptoas.cpp:1140

PTOCVAutoMarkMultiBufferPass and PTOCVMarkPreloadScopesPass are now always added to the default pipeline, but PTOEnableMultiBufferPass still only runs when --enable-multi-buffer-lowering or --enable-cv-create-preload is set. On a plain --enable-insert-sync / --enable-graph-sync-solver build, PlanMemory can therefore emit variadic pto.pointer_casts, the sync analyzers will use all N planned addresses to choose eventIdNum > 1, and SyncCodegen will emit wait_flag_dyn / set_flag_dyn rotation. EmitC later lowers the same pto.pointer_cast with adaptor.getAddrs()[0] only, so the runtime rotates event IDs but always reuses slot 0. That is a real miscompile for existing sync-only command lines on CV kernels, and it also changes memory planning even when the user never opted into multi-buffer/preload.

P1 Create-preload leaves hoisted bind_tile-backed multi-buffer casts unlowered lib/PTO/Transforms/PTOEnableMultiBuffer.cpp:146

PTOCVCreatePreloadPass hoists cloned pointer_cast + bind_tile pairs above the rewritten loop. inferRotationLoopFromUses() only inspects the direct users of the pointer_cast; in the hoisted case that direct user is the hoisted BindTileOp, which is itself outside any scf.for, so the helper returns nullptr and the pass skips lowering entirely. The remaining variadic pto.pointer_cast then reaches EmitC, which again uses address 0 only. That means different preload stages alias the same physical local-buffer slot, breaking the advertised --enable-cv-create-preload path for normal tilebuf/bind_tile based kernels.

P2 PlanMemory ignores multi-buffer offset-generation errors because the failure flag is checked too early lib/PTO/Transforms/PTOPlanMemory.cpp:1528

emitMultiBufferError() is meant to make the new retry loop reject a bad multi-buffer plan, but plan() checks multiBufferDiagnosticEmitted_ before calling UpdateBuffer2Offsets(). The new invariants added in UpdateBuffer2Offsets() (null relation slot, offset count mismatch) therefore set the flag too late: plan() still returns success and the pass continues with a partially built buffer2Offsets map. That defeats the retry/fail-fast logic and can feed invalid addresses into AllocToPointerCast instead of stopping cleanly.

P0 fixes (correctness, untested edges): * CVCreatePreload: use llvm::MapVector for LocalBufferBinding so the cloned per-stage pointer_cast / bind_tile order is deterministic across builds (DenseMap iteration order had been leaking pointer-hash into the IR). * Multi-buffer slot index now computes ((iv - lb) / step) mod N instead of iv mod N. The previous form silently dropped slots when gcd(step, N) > 1 (e.g. step=2/N=4 only ever selected {0,2}). Fixed in three codegen paths (PTOEnableMultiBuffer, GSS SyncSolverCodeGen, InsertSync SyncCodegen); step=1, lb=0 still takes the single-remui fast path. * GraphSyncSolver multi-buffer codegen now preserves the original sync anchors. cp->op1 / cp->op2 are syncIR-ordered, not MLIR-lexical-ordered; resolve via Operation::isBeforeInBlock so wait_flag_dyn lands before the iteration's first user and set_flag_dyn after the iteration's last user. Anchors in different blocks (e.g. nested scf.if) fall back to the prior body-start / before-terminator placement. P1 fixes (semantic gaps): * SyncSolver::getMultiBufferEventIdInfo now matches HIVM and PTO InsertSync: any dep pair with n<2 collapses the whole pair to single-buffer rather than being silently skipped. * CVMarkPreloadScopes: ScopeClass key now includes parent scf.for so two unrelated loops sharing the same (core, input, output) pipe stay in separate classes; fan-out / fan-in along a logical pipe is detected and emits a diagnostic instead of silently dropping siblings. * CVMarkPreloadScopes: TPipe ops nested under scf.if / scf.for inside the scope-recognition for now emit a warning and skip auto-marking, instead of being missed by the direct-children walk. * CVCreatePreload: stage-rotated multi-address pointer_casts are hoisted above the rotation loop instead of being duplicated max*N times in the body. PTOEnableMultiBuffer learns to infer the rotation loop from a loop-invariant cast's users so the lowering still finds it. * CVMarkPreloadScopes::canWrapNoResultScope walks every op (including ops inside nested regions of the moved range) when checking for SSA escapes, not just the top-level moved ops. Tests: * New regression test/lit/pto/multi_buffer_step_not_one.pto covers the step!=1 slot-index path. * Existing cv_create_preload_scope_pipeline.pto updated for the hoisted stage pointer_cast layout. * Full lit suite (194 tests) passes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

zhangstevenunity and others added 12 commits May 2, 2026 20:20

PTOPlanMemory: align with HIVM design (SPEC_LEVEL_3 + retry loop)

135014d

Co-authored-by:

b4d023e

Merge branch 'main' into enhance-pto-plan-memory-spec-level3-retry

e24b5a1

Merge remote-tracking branch 'upstream/main' into claude/distracted-l…

1111ec1

…ehmann-807dab # Conflicts: # lib/PTO/Transforms/CMakeLists.txt # tools/ptoas/ptoas.cpp

Merge remote-tracking branch 'upstream/main' into claude/distracted-l…

63a08ec

…ehmann-807dab

Add CV preload scope marking and expansion

aa5ff46

zhangstevenunity changed the title ~~Addcv preload scope marki~~ Add auto cv pipeline mulit buffer enable and preload support May 5, 2026

zhangstevenunity marked this pull request as ready for review May 5, 2026 07:44

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add auto cv pipeline mulit buffer enable and preload support#620

Add auto cv pipeline mulit buffer enable and preload support#620
zhangstevenunity wants to merge 13 commits intomainfrom
codex/cv-preload-scope-marking

zhangstevenunity commented May 5, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 5, 2026

Uh oh!

gemini-code-assist Bot May 5, 2026

Uh oh!

gemini-code-assist Bot May 5, 2026

Uh oh!

gemini-code-assist Bot May 5, 2026

Uh oh!

gemini-code-assist Bot May 5, 2026

Uh oh!

reedhecre commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhangstevenunity commented May 5, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

reedhecre commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codex Review

Summary

Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

reedhecre commented May 5, 2026 •

edited

Loading